2017-06-25

爬虫笔记3

Avoid Scraper Trapper

requests library enables use to handle form on website, is algo good at setting headers. HTTP headers contains attributes, preferences, sent by you every time you make a request to server.

Header used by a typical Python scraper using the default url lib library might send:

Accept-Encoding: identity

User-Agent: Python-urllib/3.4

Good website, https://www.whatismybrowser.com , to test browser properties viewable by server.

Usually the one setting, that really matters for websites to check for “humanness” based on, is “User-Agent”.

Headers change bring a lot conveince

let’s say you need some Chinese material, just simply changing Accept-language: en-US to Accept-Language: zh.

Mobile devices have a different version of web page, so set as this:

User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X AppleWebKi/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257)

brings a great change.

Handling Cookies

Cookies can keep you logged in on a site.

There are a number of browser plug-ins that can show you how cookies are being set as you visit and move, https://www.editthiscookie.com/ , a Chrome extension, is very good.

Request library will be unsable to handle many of the cookies produced by modern software, use Selenium and PhantomJS packages.

driver = web driver.PhantomJS(executable_path = '<Path to Phantom JS>')
driver.get('https://pythonscraping.com')
driver.implicitly_wait(1)
driver.get_cookies() //save cookies
driver2.delete_all_cookies() //delete all cookies
for cookie in savedCookies:
driver2.add_cookies(cookie)
driver2.get("http://pythonscraping.com")
driver.implicitly_wait(1)
print(driver2.get_cookies())

Timing Is Eveything

Even sometimes use multithreaded jobs can make your scraper faster than one single thread, but keep individual page loads and data requests to a minimum, can try to space them by a few seconds,

time.sleep(3).

Reference:

Book: Web Scraping with Python: Collecting Data from the Modern Web.

https://www.pythonscraping.com/

Bowen He's Blog

劝君惜取少年时.

爬虫笔记3

Avoid Scraper Trapper

Headers change bring a lot conveince

Handling Cookies

Timing Is Eveything